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ABSTRACT 

We proposed a Least Information theory (LIT) to quantify 
meaning of information in probability distribution changes, 
from which a new information retrieval model was devel- 
oped. We observed several important characteristics of the 
proposed theory and derived two quantities in the IR con- 
text for document representation. Given probability distri- 
butions in a collection as prior knowledge, LI Binary (LIB) 
quantifies least information due to the binary occurrence of 
a term in a document whereas LI Frequency (LIF) measures 
least information based on the probability of drawing a term 
from a bag of words. Three fusion methods were also de- 
veloped to combine LIB and LIF quantities for term weight- 
ing and document ranking. Experiments on four benchmark 
TREC collections for ad hoc retrieval showed that LIT-based 
methods demonstrated very strong performances compared 
to classic TF*IDF and BM25, especially for verbose queries 
and hard search topics. The least information theory offers 
a new approach to measuring semantic quantities of infor- 
mation and provides valuable insight into the development 
of new IR models. 



Categories and Subject Descriptors 

H.3.1 [Information storage and retrieval]: Content Anal- 
ysis and Indexing 
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tion Search Retrieval 
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1. INTRODUCTION 

Shannon's mathematical theory of communication, com- 
monly known as the information theory, has been used in 
a wide spectrum of areas including digital coding, commu- 
nication, and information technology applications [221 123] . 
Modeling information as reduction of entropy (uncertainty) 
provides a valuable vehicle in the design and engineering 
of information systems. In information retrieval (IR), in- 
formation and probability theories have provided important 
guidance to the development of classic techniques such as 
TF*IDF, probabilistic retrieval, and language modeling [19] . 

Despite its broad use, there are assumptions that define 
the boundary of the classic information theory, beyond which 
its application requires careful examination of domain con- 
texts [201 [4], The original purpose of Shannon's theory, as 
noted in his master piece, was for engineering communica- 
tion systems where the "meaning of information was consid- 
ered irrelevant" [221 p. 379]. Information retrieval research 
is centered around the notion of relevance, for which it is 
crucial to decode meanings of information. To quantify the 
"semantic amount" of information requires an extension of 
Shannon's theory, better clarification of the relationship be- 
tween information and entropy, and justification of this re- 
lationship 23 . Although various measures such as mutual 
information and KL information (relative entropy) have been 
adopted, we observe that several important characteristics 
about an ideal information quantity in the IR context are 
yet to be met [TZirSl?] . 

In this article, we present the least information theory 
(LIT) which quantifies information required to explain prob- 
ability distribution changes. The theory extends Shannon's 
theory by going beyond the entropy-reduction notion of in- 
formation. Similar to relative entropy, the proposed quantity 
is a non-linear function of entropy and emphasizes meaning 
in probabilities of inferences. The formulation removes as- 
sumptions in existing models that are unnecessary in the 
IR context and meets several important characteristic ex- 
pectations. We applied the new theory in modeling ad hoc 
retrieval and showed strong experimental results compared 
to classic TF*IDF and Okapi BM25 on four benchmark IR 
collections. 



2. PROPOSED THEORY 

In this section, we propose a new theory to quantify mean- 
ing of information via an extension of Shannon's entropy 
equation. We start with an example to motivate discussions 
on what to expect about the theory and introduce the least 
information theory in which expected characteristics are ob- 



served. 

2.1 A Motivating Example 

Let's start the discussion with a simple binary case. Sup- 
pose there are two exhaustive and mutually exclusive in- 
ferences A and B on a given hypothesis, with probabilities 
p a and pt respectively (e.g., the likelihood of each candi- 
date winning an election in a one-on-one race). Given the 
probability distribution, it is straightforward to measure the 
uncertainty of the inference system using Shannon's entropy 
formula: H = — fcj^plnp. When the outcome is known, the 
uncertainty is reduced to zero and the amount of (missing) 
information, according to Shannon, can be taken as the re- 
duction of the uncertainty [22]. This entropy-based measure 
is essentially to determine the amount of missing informa- 
tion given a specified distribution regardless of the ultimate 
outcome [3]. 

However, the notion of information as a linear function of 
reducing uncertainty has counterintuitive implications when 
the meaning of outcome is taken into account. Suppose p a 
is much larger than p n (e.g., candidate A is more likely to 
win the election). Intuitively speaking, the outcome of B 
being the correct inference appears to require more infor- 
mation for explanation than does the ultimate inference of 
A - for example, the less likely (weaker) candidate winning 
an election is bigger news and requires more explanation 
than otherwise. 

2.2 Model Expectation 

If information is a function of linear uncertainty reduction, 
whatever the outcome is has no influence on the amount 
of information that explains the outcome, which is against 
our intuition. In the special case of the above example, the 
amount of information should not only depend on the un- 
certainty of inferences but also the ultimate outcome (the 
correct inference). Furthermore, we reason that, while un- 
certainly depends only on a specified probability distribu- 
tion, the amount of information required to explain the out- 
come and more generally to explain a probability distribu- 
tion change is beyond a linear function of uncertainty. 

Indeed, using Shannon's entropy measure to quantify the 
amount of meaningful information is beyond the scope of 
classic information theory. The original purpose of Shan- 
non's theory, as noted in his master piece, was for engineer- 
ing communication systems where the "meaning of informa- 
tion was considered irrelevant" [221 p. 379]. Information 
retrieval is centered on the notion of relevance, which has 
an important semantic (meaning) dimension. Measuring 
"semantic quantities" of information requires an extension 
of Shannon's theory, better clarification of the relationship 
between information and entropy, and justification of this 
relationship. Efforts have been done with limited progress 
on identifying meaning quantitatively [23] . 

While theories such as KL information {relative entropy) 
offer alternatives to the simplified entropy reduction view of 
information, some characteristics of relative entropy do not 
meet our expectations about such a measure. Specifically, 
the asymmetry of the KL function is due to an assumption 
about one distribution being truer than the other, which 
is not necessarily realistic. In addition, relative entropies 
over the course of continuous probability changes in one di- 
rection do not add up to the overall amount. Finally and 
very importantly, extreme probability changes (e.g., when a 



probability changes from a tiny value to nearly 1) lead to 
infinite KL information, which is a particularly undesirable 
property for term weighting in information retrieval. 

2.3 Least Information (LI) 

In this section, we present the proposed least information 
theory. Let X be prior (initially specified) probabilities for 
a set of exhaustive and mutually exclusive inferences: X — 
[xi, X2, ■ -, x„], where Xi is the prior probability of the i th 
inference on a given hypothesis. Let Y denotes posterior 
(changed) probabilities after certain information is known: 
Y — [j/i, j/2, Vn], where yi is the informed probability of 
the i th inference. Uncertainties of the two distributions is 
computed by Shannon entropy: 

n 

H(X) = -k^Xilnxi (1) 

i=i 

n 

H(Y) = -kJ^y^Vi (2) 
1=1 

The amount of information obtained from X to Y , in 
Shannon's treatment, can be measured via the reduction 
of entropy: 

AH = H(Y) - H(X) (3) 

Inferences are semantically exclusive and involve different 
meanings. When probabilities vary from X to Y, the two 
distributions are semantically different and it is obvious that 
some amount of information is responsible for the variance. 
Therefore, we need to examine the amount of information 
associated with individual inferences via the measurement of 
uncertainty change. With Equation^ however, it is easy to 
show that when there are changes in the probabilities, there 
may be increases, decreases, or no change in the overall un- 
certainty. We observe that even when there is no change in 
the entropy, there is still an amount of information respon- 
sible for any variance in the probability distribution. To 
use the overall (system-wide) uncertainty for the measure- 
ment of information ignores semantic relevance of changes 
in individual inferences. 

Here our new least information model departs from the 
classic measure of information as reduction of uncertainty 
(entropy). First, we reason that a change in the uncertainty 
of an inference, either an increase or decrease, requires a rel- 
evant amount of information that is semantically responsible 
for it. The overall information needed to explain changes in 
all inference probabilities is the sum of individual pieces of 
information associated with each inference. 

Second, for an individual inference i, the probability may 
vary in one of the two semantic directions, i.e., to increase 
or to decrease it. In either case, there is always a (positive) 
amount of information responsible for that variance. If we 
assume inferences are semantically independent, the abso- 
lute values of these independent pieces of information add 
linearly to the overall amount of information. 

In addition, it is reasonable for such an information quan- 
tity to meet the condition that continuous, smaller changes 

inference probabilities are never perfectly independent of 
one another given the degree of freedom. But to simply 
the discussion and formulation, we take the independence 
assumption. 



in one direction should add incrementally to a bigger change 
in the same direction. That is, pieces of information respon- 
sible for small, continuous changes of an inference probabil- 
ity in the same direction should add up to the amount of 
information for the overall change. For example, if the i th 
inference's probability increases from Xi to yi and then to 
Zi, the least amount of information required for the change 
from Xi to i/i and the amount from to z; should add up 
to the overall least information required for the change from 
Xi to Zi . We define d Hi as the amount of entropy change 
due to a tiny change dpi of probability pi\ 

dHi = -\np l dp i (4) 

In the configuration view of entropy, this microscopic vari- 
ance of entropy due to a small change in an inference's prob- 
ability is the change of the weighted (pi) number of config- 
urations (In J-) [3J. In other words, it is the change in the 

number of configurations (In ^-) due to a varied probability 
weight (pi). 

Every tiny change in the probabilities requires some expla- 
nation (information). Aggregating (integrating) the small 
changes of uncertainty leads to the amount of information re- 
quired for a macro-level change. A macroscopic uncertainty 
change due to a significant probability shift of an inference 
is therefore the sum (integration) of continuous microscopic 
changes in the variance range. Therefore, we define the least 
amount of information Ii required to explain the probability 
change of the i th inference as the integration (aggregation) 
of all tiny absolute (positive) changes of entropy dHi\ 

h = \f AHi\ (5) 

\Vi 

= pi(l-lnpi) (6) 

We define informative entropy gi as a function of an in- 
ference's probability: 

gi = pi(l-lnpi) (7) 

The equation for least information U for the i th inference 
can be rewritten as: 

h = \g(vi) - g(%i)\ (8) 

The total Least Information I is the sum of partial least 
information in every inference: 

n 

I = E 7 * ( 9 ) 

n 

= X)|s(w)-s(s<)| (io) 

i=l 
n 

= ^2\yi(l-lnyi)-Xi{l-lD.Xi)\ (11) 

i=l 

where n is the number of inferences, Xi is the initially 
specified probability of the i th inference, and {/; the revised 
probability of the i th inference. 



2.4 Important Model Characteristics 

It is worth noting that Equation [11] is to measure the 
least amount of information required to explain a probabil- 
ity distribution change for a set of inferences. Given that 
information may alter a probability distribution in various 
semantic directions and change the uncertainty in both pos- 
itive and negative directions, the actual amount of informa- 
tion leading to such a change may consist of multiple pieces 
of information acting on different directions. 

Without an exhaustive analysis of the process, the actual 
amount of information cannot be deduced solely from an in- 
vestigation of probability distributions. It is only reasonable 
to quantify the least information needed for that change - 
that is, the sum of all needed amounts of information at the 
very least, every tiny piece of which contributes in the same 
direction of a change. In addition, this model does not con- 
sider the process of removing information, which, in effect, 
is equivalent to adding another piece of information that has 
perfectly opposite semantic^ in the same amount. 




P1 : Probability for Inference 1 (of 2 mutually exclusive) 



Figure 1: Least Information vs. Entropy: Reduc- 
ing two exclusive uncertain inferences to certainty. 
Log functions in equations use the natural base. The 
asymmetry of least information in the plot is a man- 
ifestation of its dependence on the outcome. 

Based on Equation 1111 several important characteristics 
of least information can be observed. Figure [1] compares 
the least information measure with entropy reduction in a 
two-exclusive-inference case. We summarize some of these 
characteristics below. 

• Absolute information and symmetry: The amount of 
least information required for a probability change from 
X to Y is the same as that from Y to X, though their 
semantic meanings are different. 

• Addition of continuous change: Amounts of least infor- 
mation for small, continuous probability changes in the 

2 The term opposite does not indicate true vs. false infor- 
mation. Opposite information semantics can be seen, in a 
sense, as good news vs. bad news. 



same semantic directions add linearly to the amount 
of least information responsible for the overall change. 
In short, I(X -> Z) = I(X -> Y) + I(Y ->• Z), if and 
only if X — > Y and Y — > Z are in the same semantic 
direction. 

• Unit Information: In the special case when there are 
two equally possible inferences, the amount of least 
information needed to explain an outcome (certainty) 
is one: I(pi = P2 = \ — > Pi = 1) = 1, regardless of the 
log base in the equation (see Figure [T]). 

• In the special case of reducing uncertain inferences to 
certainty (with the ultimate case): 

— With equally likely inferences, when there are more 
choices, the least information needed to explain 
an outcome is larger. 

— The less likely the outcome, the larger the amount 
of least information needed to explain it. 

• Zero least information: The amount of least informa- 
tion is zero if and only if there is no change in the 
probability distribution. 

2.5 Least Information Modeling for IR 

Now we apply the proposed least information theory (LIT) 
to information retrieval (IR) for term weighting and docu- 
ment ranking. With a focus on quantifying semantics of 
information, the least information measure is theoretically 
compatible with the central problem in IR, which is about 
semantic relevance. 

In the bag-of-words approach to IR, a document can be 
viewed as a set of terms with probabilities (estimated by 
frequencies) of occurrence. While the entire collection rep- 
resents the domain in which searches are conducted, each 
document contains various pieces of information which dif- 
ferentiate itself from other documents in the domain. By 
analyzing a term's probability (frequency) in a document 
vs. that in the collection, we can compute information pre- 
sented by the document in the term to weight the term. In 
other words, taking domain distributions as prior knowledge, 
we can measure the amount of least information conveyed 
by a specific document when it is observed. 

In particular, we conjecture that the larger amount least 
information is needed to explain a term's probability in a 
document, the more heavily the term should be weighted to 
represent the document. Hence, we transform the question 
of document representation into weighting terms according 
to their amounts of least information in documents. In this 
study, we propose two specific weighting methods, one based 
on a binary representation of term occurrence (0 vs. 1) and 
the other based on term frequencies. These two methods 
will be used separately and combined in fusion methods as 
well. 

2. 5. 1 LI Binary (LIB ) Model 

In the binary model, a term either occurs or does not oc- 
cur in a document. If we randomly pick a document from the 
collection, the chance that a term ti appears in the document 
can be estimated by the ratio between the number of doc- 
uments containing the term m (i.e., document frequency) 
and the total number of documents N. Let p(ti\C) = Ui/N 
denotes the probability of term ti occurring in a randomly 



picked document in collection C; p(\ti\C) is the probability 
that the term does not appear: 

p(\U\C) = l-p(ti\C) = 1-m/N 

When a specific document d is observed, it becomes cer- 
tain whether a term occurs in the document or not. Hence 
the term probability given a specific document p(ti\d) is ei- 
ther 1 or 0. Given the the definition of gi in Equation [7] 
the least amount of information in term ti from observing 
document d can be computed by: 

I(U,d) = \g(ti\d) - g(ti\C)\ 

+\g(\ti\d) + g(\ti\C)\ (12) 

The above equation gives the amount of information a 
term conveys in a document regardless of its semantic direc- 
tion. When a query term ti does not appear in document 
d, the least information associated with the term should be 
treated as negative because it makes the document less rel- 
evant to the term. Hence, the ranking function should not 
only consider the amount of information but also the sign 
(positive vs. negative) of the quantity. Hence, LI Binary 
(LIB) is computed by: 

LIB 2 {U,d) = g(ti\d) - g(U\C) 

-g(\ti\d) - g(\U\C) (13) 

Keeping only quantities related to ti (and removing those 
associated with Hi), we simplify the LIB equation to: 

LIB{U,d) = g(ti\d) - g{ti\C) (14) 
= ff(ti|rf)-|(l-ln^) (15) 

The total least information of all query terms in the doc- 
ument d is computed by: 

LIB(q,d) = ^ LIB(U, d) (16) 

tiGq 

The quantity LIB(ti,d) depends on the observation of 
term ti in the document: g(ti\d) is 1 when ti appears in 
document d and if otherwise, according to Equation [7] 
That is: 

f 1 - a fl - In a) ti € d 
LIB(ti,d)=\ "J (17) 

where nt is the document frequency of term ti and N is 
the total number of documents. The larger the LIB, the 
more information the term contributes to the document and 
should be weighted more heavily in the document represen- 
tation. LIB is similar in spirit to IDF and its value represents 
the discriminative power of the term when it appears in a 
document. 

2.5.2 LI Frequency (LIF) Model 

In LI Frequency (LIF) model, we use term frequencies to 
model least information. Treating a document collection C 



as a meta-document, the probability of a randomly picked 
term from the collection being a specific term ti can be es- 
timated by: p(ti\C) = Fi/L, where F t is the total number 
of occurrences of term ti in collection C and L the overall 
length of C (i.e., the sum of all document lengths). 

When a specific document d is observed, the probability 
of picking term ti from this document can be estimated by: 
p(ti\d) = tf iy d/Ld, where tf it d is the number of times term ti 
occurs in document d and Ld is the length of the document. 
Again, for each term ti, there are two exclusive inferences, 
namely the randomly picked term being the specific term (ti) 
or not (Hi). To quantify a term's LIF weight, we measure 
least information that explains the change from the term's 
probability distribution in the collection to its distribution 
in the document in question: 



LIF 2 (U,d) = 



g(ti\d) - g(ti\C) 
+g(lti\C) - g(lti\d) 



(18) 



We focus on the quantities g(ti\d) and g(ti\C) to estimate 
least information of each term when a specific document is 
observed. Without quantities g(\ti\C) and g(\ti\d), LIF is 
computed by: 



LIF(U,d) 



g(U\d) - g(ti\C) 



tfi,d 
L d 
Fi 



(1 - In 



tfi,d - 



(19) 



(20) 



Hence, the LI Frequency (LIF) ranking score can be com- 
puted by the sum of least information in all query terms: 



LIF(q,d) = Y,9(ti\d)-9{ti\C) 



y ^(i_i n ^£ 



(21) 



(22) 



where tf it d is term frequency of term ti in document d 
and Ld is the document length. Fi is collection frequency of 
term ti (sum of term frequencies in all documents) whereas 
L is the overall length of all documents. 

In a sense, LIF can be seen as a new approach to mod- 
eling term frequencies with document length and collection 
frequency normalization. In this study, we use raw term 
frequencies to estimate probabilities and do not use any 
smoothing techniques to fine tune the estimates. 

2.5.3 Fusion of LIB &UF 

While LIB uses binary term occurrence to estimate least 
information a document carries in the query terms, LIF mea- 
sures the information based on term frequency. The two are 
related quantities with different focuses. As discussed, the 
LIB quantity is similar in spirit to IDF (inverse document 
frequency) whereas LIF can be seen as a means to normalize 
TF (term frequency). 

In light of TF*IDF, we reason that combining the two 
will potentiate each quantity's strength for term weighting, 
ultimately leading to improved document ranking. Hence we 



propose three fusion methods to combine the two quantities 
by addition and multiplication: 

1. LIB+LIF: To weight a term, we simply add LIB and 
LIF together by treating them as two separate pieces 
of information. The ranking score of a document is 
then the sum of all LIB+LIF quantities in the query 
terms. 

2. LIB*LIF: In this fusion method, we follow the idea 
of TF*IDF by multiplying LIB and LIF quantities for 
each term. Because individual least information values 
fall in the range of [—1, 1] and can be negative, we 
normalize LIB and LIF values to [0, 2] by adding 1 to 
each before multiplication. Again, document ranking 
is then based on the linear sum of LIB*LIF quantities 
in the query terms. 

3. LICos: This method combines LIB+LIF with cosine 
similarity. We use LIB+LIF for term weights to repre- 
sent documents in VSM (vector space model) and rank 
documents based on their Cosine coefficients with the 
binary vector representation of a query. 

These fusion methods allow us to examine potential strengths 
and weaknesses of the proposed least information modeling 
for IR. We study LIB and LIF as well as the above fu- 
sion methods in experiments. And given the effectiveness 
of TF*IDF and especially its BM25 variation in traditional 
ad hoc retrieval experiments, we use them as baselines in 
the experiments. 

3. EXPERIMENTAL SETUP 
3.1 Data Collections and Topics 

We used the following data sets from the Linguistic Data 
Consortium and NIST for retrieval experiments: the TIP- 
STER corpus (Disks 2 and 3), TREC Disks 4 and 5, and the 
AQUAINT I corpus (roughly a million news documents from 
New York Times, AP, and Xinhua [27]). These data had 
been widely used in TREC for ad hoc retrieval experiments. 
We relied on the following TREC topics and relevance bases 
for IR evaluation: 

• TREC 2 routing topics 51 - 100 with title, description, 
summary, narrative, and concepts (disk 3) [21] : 

• TREC 4 ad hoc topics 201 - 250 with natural language 
descriptions only (disks 2 and 3) [6]; 

• TREC 7 ad hoc topics 351 - 400 with title, description, 
and narrative (disks 4 and 5 minus the Congressional 
Record) [28]; 

• TREC 2005 HARD/Robust 50 topics with title, de- 
scription, and narrative ranging from 303 - 689 (AQUAINT 
I data) [57]. 

These collections represent a diversity of text data and 
query tasks. In TREC 2, for example, the concepts field in 
51 - 100 topics contains a verbose list of concepts to repre- 
sent each search topic. Text queries automatically generated 
from the concept lists are likely to be more accurate than 
general descriptions in sentences. On the other hand, TREC 
4 topics 201 - 250 only have natural language descriptions of 



queries. TREC 2005 HARD and Robust topics were devel- 
oped as a list of difficult topics from previous years' ad hoc 
experiments. Using these diverse data and topics enabled a 
relatively thorough examination of the proposed methods' 
effectiveness in various domain and task contexts. 

3.2 Experimental System 

We implemented the retrieval ranking methods using the 
Lucene core search engine library in Java [7]. We reused 
the Okapi BM25 implementation reported in [16] and vali- 
dated by [19], which achieved highly competitive results in 
recent years' TREC competitions. We set parameter val- 
ues b — 0.75 and k\ = 1.5 for BM25, according to exist- 
ing research on related data. In addition, we developed 
the following proposed methods for Lucene scoring (rank- 
ing): LIB, LIF, LIB+LIF, LIB*LIF, and LICos. Two classic 
TF*IDF methods, one with document length normalization 
(TFjv*IDF) and the other without (TF*IDF), were also im- 
plemented as baselines. We performed standard tokeniza- 
tion, casefolding, and stop-word removal for indexing. For 
each data collection, one set of experiments were conducted 
with stemming and the other without it. 

3.3 Evaluation Metrics 

We used human relevance judgment (QRELs) developed 
for TREC 2, TREC 4, TREC 7, and TREC 2005 HARD 
(Robust) tracks as the gold standard for each set of ex- 
periments. We compared the proposed methods with clas- 
sic TF*IDF and Okapi BM25 methods. Evaluation metrics 
included mean average precision with arithmetic averaging 
(MAP) and geometric (gMAP), best precision at rank 10, 
normalized discounted cumulative gain at 10 (nDCGio), 
and recall precision. While arithmetic average MAP pro- 
vides a simple mean score across multiple queries, the ge- 
ometric average (gMAP) is sensitive to poorly performed 
tasks and is a very useful metric developed for 2005 HARD 
track [27] . NDCG favors early retrieval of highly relevant 
documents in a ranked list and has become widely adopted 
for ranked retrieval evaluation [8]. 

4. EXPERIMENTAL RESULTS 

Figure [2] provides an overview of major results. From 
the plots, the proposed LICos method appeared to have 
achieved best results and was better than BM25 in most ex- 
periments. All LIB related methods such as LIB, LIB+LIF, 
and LIB*LIF overwhelmingly outperformed TF*IDF meth- 
ods, especially in TREC2, TREC7, and TREC'05 HARD/Robust 
experiments. In many cases, the LIB-related methods were 
more than 100% better than TF*IDF baselines (i.e., rela- 
tive scores > 2). LICos consistently outperformed BM25 
in terms of gMap in all experiments, indicating that it did 
relatively well with poorly performed topics. 

In sections 14.11 - l4~4l we discuss detailed experimental re- 
sults on the four benchmark test collections. In each of Ta- 
bles [T]- [6] one set of experiments were conducted with stem- 
ming and the other without. Best scores in each evaluation 
metric are highlighted in bold fonts. Section 14.51 presents 
our observation about the impact of query verbosity on pro- 
posed methods' effectiveness. 

4.1 TREC 2 Topics on Disk 3 

Table [T] shows results from experiments on disk 3. In 
TREC 2 topics, each query was described using a verbose list 




gMAP MAP P10 nDCG Rpr gMAP MAP P10 nDCG Rpr 



TREC 2 Experiments TREC 4 Experiments 

. Methods: 




gMAP MAP P10 nDCG Rpr gMAP MAP P10 nDCG Rpr 

TREC 7 Experiments TREC 2005 HARD/Robust 



Figure 2: Overview of Experimental Results (with 
stemming). X has evaluation metrics. Y is relative 
performance score in each metric as a ratio to the 
TFjv*IDF baseline. TFjv*IDF scores are always 1 as 
the baseline. A score at 2, for example, indicates it 
is twice the baseline score. 

of concepts (good keywords). With these manually picked 
concept terms, which are overall quite precise in defining the 
topic, LICos (least information with cosine similarity) out- 
performed all the other methods in every evaluation metric 
we used. Stemming appeared to further improve LICos's 
effectiveness. Overall, BM25 also performed very well and 
was second only to LICos in most cases, followed closely by 
LIB*LIF. Most proposed methods based on least informa- 
tion, especially LIB*LIF and LIB+LIF, outperformed ordi- 
nary TF*IDF and TFjv*IDF (with length normalization of 
TF) by a good margin. 

4.2 TREC 4 Topics on Disks 2&3 

Table[2]shows results from TREC 4 experiments on disks 2 
& 3. Again, LICos continued to dominate best scores, espe- 
cially when stemming was used. TREC 4 topics only had de- 
scriptions written in natural language sentences. Stemming 
improved LICos effectiveness but slightly degraded BM25 
performance. 

While the two had very close scores in several metrics, LI- 
Cos was consistently better than BM25 in terms of gMAP 
(geometric averaging MAP) and Pio. The evaluation met- 
ric gMap is biased toward poorly performed queries (hard 
tasks). LICos appeared to perform better on difficult top- 
ics than BM25 did to achieve a higher gMap. We shall 
see later most of the proposed methods performed well on 
TREC 2005 HARD/Robust 's topics, which were considered 
difficult topics in TRECs. 

4.3 TREC 7 Topics on Disks 4&5 

In TREC 2 and TREC 4 experiments, we used two dif- 



Method 


gMAP 


MAP P10 nDCG 


R P R 


Concept-only Search Without Stemming 


BM25 


0.288 


0.407 


0.597 


0.504 


0.451 


TF*IDF 


0.090 


0.187 


0.127 


0.0911 


0.182 


TFjv*IDF 


0.125 


0.300 


0.328 


0.241 


0.304 


LIB 


0.223 


0.348 


0.516 


0.417 


0.389 


LIF 


0.125 


0.294 


0.309 


0.251 


0.294 


LIB+LIF 


0.236 


0.357 


0.545 


0.434 


0.399 


LIB*LIF 


0.240 


0.361 


0.562 


0.446 


0.402 


LICos 


0.301 


0.413 


0.635 


0.523 


0.464 


Concept-only Search With Stemming 


BM25 


0.281 


0.399 


0.565 


0.488 


0.442 


TF*IDF 


0.0711 


0.164 


0.132 


0.0822 


0.160 


TFjv*IDF 


0.110 


0.282 


0.310 


0.238 


0.286 


LIB 


0.173 


0.313 


0.467 


0.374 


0.352 


LIF 


0.064 


0.278 


0.313 


0.244 


0.280 


LIB+LIF 


0.162 


0.331 


0.494 


0.406 


0.370 


LIB*LIF 


0.185 


0.341 


0.508 


0.416 


0.372 


LICos 


0.309 


0.423 


0.659 


0.554 


0.477 



Table 1: TREC 2 Concept-only Retrieval (Disk 3) 



ferent fields/sources, namely concepts and description, to 
form long queries. In TREC 7 experiments, we used the 
title field to examine the effectiveness of the proposed meth- 
ods with short queries. Table [3] shows results from these 
experiments, in which BM25 achieved slightly better scores 
in Pio and nDCGio, which favor early retrieval of relevant 
documents. However, with short queries based on title, 
LICos performed much better than BM25 did in terms of 
gMAP, which biased toward poorly performed topics. This 
again indicates potential advantage of the proposed meth- 
ods in search tasks that may have been challenging to tradi- 
tional methods. The other proposed methods such as LIB, 
LIB+LIF and LIB*LIF came closely below BM25 but con- 
sistently outperformed TF*IDF methods by a large margin 
in each evaluation metric. 

4.4 TREC 2005 HARD/Robust 

Experiments on the earlier TREC collections above showed 
the proposed methods, especially the LICos method, per- 
formed very competitively and in many cases outperformed 
a well-tuned Okapi BM25. Now we discuss experiments on 
the more recent TREC 2005 HARD/Robust collection, in 
which 50 topics are considered difficult retrieval tasks. We 
used title, description, and title+description as queries in 
the experiments. 

Table [4] shows retrieval performances using the topic ti- 
tle field for query representation. The proposed methods, 
especially LIB and LICos, achieved best results in terms of 
gMAP, MAP, and R PR . BM25 and TF*IDF, without stem- 
ming, performed slightly better in Pio and nDCGio- Over- 
all the proposed methods dominated best results, especially 
when terms were stemmed. 

When we used topic descriptions for query representation, 
as shown in Table [5] the proposed methods outperformed 
BM25 and TF*IDF methods across all metrics. In partic- 
ular, LIB, LIB+LIF, and LICos produced very competitive 
results. 

When both title and description fields were used (com- 
bined) for queries, the proposed methods demonstrated an 



Method gMAP 


MAP P10 


nDCG 


R P R 


Desc-only Search Without Stemming 


BM25 


0.155 


0.318 


0.515 


0.409 


0.380 


TF*IDF 


0.0178 


0.108 


0.141 


0.0674 


0.117 


TF]v*IDF 


0.0212 


0.156 


0.145 


0.0914 


0.163 


LIB 


0.0327 


0.126 


0.216 


0.117 


0.133 


LIF 


0.0052 


0.135 


0.139 


0.086 


0.141 


LIB+LIF 


0.0288 


0.133 


0.198 


0.120 


0.143 


LIB*LIF 


0.031 


0.142 


0.201 


0.132 


0.156 


LICos 


0.191 


0.295 


0.536 


0.393 


0.376 


Desc-only Search With Stemming 


BM25 


0.190 


0.316 


0.501 


0.394 


0.370 


TF*IDF 


0.0122 


0.122 


0.176 


0.0728 


0.126 


TFjv*IDF 


0.0584 


0.155 


0.129 


0.0853 


0.160 


LIB 


0.0282 


0.0968 


0.175 


0.0905 


0.105 


LIF 


0.00515 


0.124 


0.113 


0.0768 


0.125 


LIB+LIF 


0.0187 


0.113 


0.210 


0.117 


0.128 


LIB*LIF 


0.0235 


0.123 


0.202 


0.125 


0.135 


LICos 


0.217 


0.304 


0.559 


0.403 


0.391 



Table 2: TREC 4 Desc-only Retrieval (Disks 2&3) 



even larger advantage over BM25 and TF*IDF, as shown in 
Table [5] Whereas LIB, LIB+LIF, and LIB*LIF all outper- 
formed the classic methods, LICos (with stemming) achieved 
a score roughly 20% higher than that of BM25 in every met- 
ric. 

TREC 2005 HARD/Robust topics represent difficult in- 
formation needs, for which query specification is challeng- 
ing. The proposed methods appeared to perform better with 
these tougher tasks, as was so suggested by the higher gMAP 
scores in earlier experiments. The methods also performed 
very competitively with long queries (concepts and descrip- 
tions). Overall, stemming improved the proposed methods' 
effectiveness. 

Note that in all experiments, the proposed ranking meth- 
ods based on least information were used without any tun- 
ing. Neither did we use additional data sources for query 
expansion. Although our results remain very competitive 
compared to reported results in TREC, this is not a fair 
comparison because participating systems in TREC were of- 
ten trained and tuned, sometimes with additional data. In 
TREC 2005 Robust track, for example, additional resources 
such as WordNet and Wikipedia were reportedly used to 
boost results [15] . 

4.5 Impact of Query Verbosity 

We observed that query verbosity had an impact on the 
proposed methods' retrieval effectiveness. With (longer) ver- 
bose queries, methods such as LICos, LIB+LIF, and LIB*LIF 
appeared to outperform baseline methods by a greater mar- 
gin. In TREC'05 experiments, for example, LICos with 
queries based on the description field produced Pio and 
nDCGio scores nearly 30% higher than those based on title 
queries (see Figure [3]) . The improvement was much larger 
than that of BM25. With verbose queries, having good terms 
(e.g., using the concepts field and adding title to descrip- 
tion) for query representation also appeared to strengthen 
the proposed methods' advantage over BM25 and TF*IDF. 

5. RELATED WORK 



Method 


gMAP 


MAP P10 nDCG 


Rpr 


Title-only Search Without Stemming 


BM25 


0.0682 


0.242 


0.482 


0.337 


0.360 


TF*IDF 


0.0334 


0.113 


0.219 


0.107 


0.170 


TFjv*IDF 


0.0129 


0.087 


0.188 


0.0844 


0.172 


LIB 


0.0653 


0.236 


0.349 


0.250 


0.282 


LIF 


0.012 


0.0813 


0.150 


0.0803 


0.133 


LIB+LIF 


0.0665 


0.248 


0.411 


0.305 


0.331 


LIB*LIF 


0.0662 


0.247 


0.429 


0.317 


0.334 


LICos 


0.173 


0.251 


0.466 


0.316 


0.346 


Title-only Search With Stemming 


BM25 


0.0681 


0.242 


0.479 


0.346 


0.374 


TF*IDF 


0.0295 


0.099 


0.190 


0.0963 


0.151 


TFjv*IDF 


0.0150 


0.079 


0.228 


0.0931 


0.161 


LIB 


0.0615 


0.215 


0.299 


0.211 


0.265 


LIF 


0.0110 


0.0744 


0.159 


0.0816 


0.132 


LIB+LIF 


0.066 


0.226 


0.415 


0.287 


0.325 


LIB*LIF 


0.0662 


0.229 


0.420 


0.311 


0.327 


LICos 


0.162 


0.232 


0.466 


0.323 


0.347 



Table 3: TREC 7 Title-only Retrieval (Disks 4&5) 



Methods: 




title desc frtle+desc title desc title+desc 

Pio (y-axis) nDCGio (y-axis) 

Figure 3: Retrieval effectiveness vs. query verbosity 
(TREC'05). X denotes query verbosity, ranging 
from title-only, desc-only, to title+desc query represen- 
tations. Y is retrieval performance in terms of Pio 
and nDCGio. 



Term probability distribution analysis has been an impor- 
tant part of information retrieval modeling. Term frequency 
and document frequency are basic examples of these fre- 
quency (probability) distributions. While term frequency 
(TF) may indicate the degree of a document's association 
with a term, inverse document frequency (IDF) is a mani- 
festation of a term's specificity, key to determine the term's 
value toward weighting and relevance ranking [10) . The two 
quantities we developed from the proposed least information 
theory, namely LI Binary (LIB) and LI Frequency (LIB), 
can be related to IDF and TF, though their formulations 
are very different. 

IDF (— hi^) resembles Shannon's entropy formula and 
several works have attempted to justify IDF from an infor- 
mation theoretic view [18]. While it has been shown that 
a term's IDF is equivalent to the mutual information be- 
tween the term and the collection [24] . the probabilistic re- 
trieval framework provides an important theoretical ground 
to IDF weights 18 . Mutual information can be interpreted 



Method 


gMAP 


MAP 


P10 nDCG 


Rpr 


Title-only Search Without Stemming 


BM25 


0.172 


0.278 


0.416 


0.271 


0.303 


TF*IDF 


0.174 


0.282 


0.410 


0.301 


0.301 


TF]v*IDF 


0.0823 


0.194 


0.231 


0.147 


0.210 


LIB 


0.192 


0.309 


0.409 


0.280 


0.322 


LIF 


0.0933 


0.226 


0.228 


0.154 


0.227 


LIB+LIF 


0.195 


0.301 


0.402 


0.273 


0.326 


LIB*LIF 


0.194 


0.300 


0.384 


0.269 


0.330 


LICos 


0.225 


0.301 


0.361 


0.282 


0.340 


Title-only Search With Stemming 


BM25 


0.166 


0.263 


0.381 


0.273 


0.296 


TF*IDF 


0.160 


0.262 


0.360 


0.281 


0.285 


TF]v*IDF 


0.056 


0.175 


0.197 


0.118 


0.191 


LIB 


0.194 


0.298 


0.388 


0.246 


0.316 


LIF 


0.0727 


0.186 


0.216 


0.124 


0.195 


LIB+LIF 


0.186 


0.284 


0.410 


0.278 


0.313 


LIB*LIF 


0.186 


0.283 


0.406 


0.274 


0.315 


LICos 


0.214 


0.283 


0.401 


0.295 


0.321 



Table 4: TREC'05 Title-only Retrieval (AQUAINT) 



as relative entropy that quantifies the difference between 
the joint probabilities and product probabilities of two ran- 
dom variables [5]. Further development of notions around 
information-theoretic entropy led to theories such as max- 
imum entropy and minimum (mutual) information princi- 
ples, providing important guidance to inferential statistics 
for retrieval modeling [9l l25l fTTI |2] . 

IDF can also be transformed into Kullback-Leibler (KL) 
information between term probability distributions in a doc- 
ument and in the collection [I], similar to the modeling of 
LIB in this work. KL divergence (relative entropy) mea- 
sures information for discrimination between two probability 
distributions by quantifying the entropy change in a non- 
symmetric manner |12j . The non-symmetry of KL diver- 
gence is due to the assumption that one of the two distri- 
butions is considered closer to the ultimate case and the in- 
formation quantity should be weighted by that distribution. 
This leads to the consequence that the (absolute) amount of 
information is different if simply the direction of change is 
different. 

Classic probabilistic retrieval and language modeling rep- 
resent two different factorizations of conditional probabil- 
ity distributions. While classic language models focused on 
the query likelihood model, some have looked at the the 
likelihood of a query language model generating the docu- 
ment, similar to the reasoning behind traditional probabilis- 
tic models [14] . Research has also employed KL informa- 
tion in language modeling to measure the difference between 
document and query models for ranking and demonstrated 
strong empirical results |13l 126] . We believe that the least 
information can be nicely integrated with these approaches. 

The proposed least information theory (LIT) quantifies 
information due to probability changes as a symmetric func- 
tion of two distributions. It extends the classic uncertainty- 
based information measure to a non-linear function of en- 
tropy that accommodates for the meaning of information. 
Just as the probabilistic retrieval framework and KL infor- 
mation offer justification to IDF, least information provides 
the theory from which LIB is developed. While IDF can be 



Method 


gMAP 


MAP P10 nDCG 


R P R 


Desc-only Search Without Stemming 


BM25 


0.204 


0.275 


0.336 


0.239 


0.290 


TF*IDF 


0.203 


0.262 


0.386 


0.273 


0.286 


TFjv*IDF 


0.0718 


0.193 


0.266 


0.176 


0.205 


LIB 


0.205 


0.308 


0.404 


0.280 


0.332 


LIF 


0.058 


0.232 


0.289 


0.186 


0.234 


LIB+LIF 


0.203 


0.303 


0.385 


0.263 


0.328 


LIB*LIF 


0.231 


0.300 


0.354 


0.249 


0.325 


LICos 


0.243 


0.308 


0.401 


0.292 


0.338 


Desc-only Search With Stemming 


BM25 


0.209 


0.293 


0.409 


0.316 


0.315 


TF*IDF 


0.202 


0.266 


0.350 


0.270 


0.283 


TFjv*IDF 


0.0663 


0.197 


0.243 


0.159 


0.209 


LIB 


0.232 


0.353 


0.460 


0.318 


0.377 


LIF 


0.0624 


0.236 


0.293 


0.195 


0.243 


LIB+LIF 


0.275 


0.351 


0.505 


0.332 


0.387 


LIB*LIF 


0.262 


0.337 


0.477 


0.324 


0.370 


LICos 


0.259 


0.330 


0.518 


0.377 


0.371 



Table 5: TREC'05 Desc-only Retrieval (AQUAINT) 



obtained from the binary independent (probabilistic) model, 
LIB is derived from a binary model of least information. 
They both address a term's discriminative power or speci- 
ficity. However, LIB falls in the range of [0, 1] without nor- 
malization - it is close to 1 for extremely rare terms and 
for stop- words. 

Document length normalization is implicit in the proposed 
model when probabilities are calculated, similar to all prob- 
abilistic and language modeling [17]. In establishing term 
probabilities in documents, we used maximum likelihood 
estimates solely based on raw frequency counts. Research 
in language modeling has studied related distributions and 
applied various smoothing methods to significantly improve 
probability estimates and retrieval effectiveness [30]. Smooth- 
ing may also be useful for the further development of least 
information modeling for IR. 

6. CONCLUSION 

In this work, we proposed the least information theory 
(LIT) to quantify the meaning of information in probabilities 
by extending the classic notion of entropy to accommodate 
a non-linear relation between information and uncertainty. 
The new formulation displays several important character- 
istics such as unit information regardless of log base, func- 
tional symmetry with regard to two distributions, and finite 
information in extreme cases. 

Applying the least information theory in information re- 
trieval, we developed two quantities for document represen- 
tation based on a term's probability distributions in a doc- 
ument vs. in the collection. Particularly, LI Binary (LIB) 
quantifies least information due to the binary occurrence of 
a term in a document, i.e., whether the term appears in the 
document or not. LI Frequency (LIF), on the other hand, 
measures the amount of least information based on the likeli- 
hood of drawing a term from a bag of words. While LIB and 
LIF are similar in spirit to classic IDF and TF respectively, 
the formulation is very different. Three additional quanti- 
ties, namely LIB+LIF, LIB*LIF, and LICos, were developed 



Method 


gMAP 


MAP 


P10 nDCG 


R P R 


Title+Desc Search Without Stemming 


BM25 


0.226 


0.297 


0.458 


0.329 


0.338 


TF*IDF 


0.210 


0.274 


0.386 


0.289 


0.293 


TF]v*iDF 


0.0886 


0.192 


0.258 


0.162 


0.201 


LIB 


0.260 


0.329 


0.445 


0.336 


0.355 


LIF 


0.102 


0.221 


0.261 


0.186 


0.225 


LIB+LIF 


0.264 


0.331 


0.447 


0.320 


0.360 


LIB*LIF 


0.263 


0.328 


0.425 


0.320 


0.359 


LICos 


0.264 


0.331 


0.490 


0.394 


0.397 


Title+Desc Search With Stemming 


BM25 


0.217 


0.291 


0.458 


0.362 


0.332 


TF*IDF 


0.202 


0.267 


0.349 


0.271 


0.286 


TF]v*IDF 


0.0945 


0.185 


0.228 


0.142 


0.198 


LIB 


0.261 


0.328 


0.483 


0.337 


0.381 


LIF 


0.145 


0.223 


0.284 


0.191 


0.242 


LIB+LIF 


0.271 


0.336 


0.510 


0.354 


0.387 


LIB*LIF 


0.272 


0.335 


0.493 


0.366 


0.387 


LICos 


0.278 


0.340 


0.555 


0.439 


0.421 



Table 6: TREC'05 Title+Desc Runs (AQUAINT) 



for term weighting and document ranking. 

Ad hoc retrieval experiments on four benchmark TREC 
collections showed that the proposed methods performed 
very competitively and in most cases outperformed classic 
TF*IDFs and a well-tuned BM25. LIT-based methods such 
as LICos and LIB+LIF were particularly effective with good 
query terms (e.g., using concepts), verbose queries (e.g., us- 
ing description + title), and in difficult tasks (e.g., on TREC 
2005 HARD/Robust collection). Note that none of the pro- 
posed methods based on least information involved training 
or tuning. For Okapi BM25, on the other hand, we adopted 
parameters that had demonstrated strong performances in 
existing experiments. 

Despite the proposed methods' superior performances, the 
improvement over existing methods is not the main point. 
Least information offers a means to quantify meaning of in- 
formation and presents a new way of thinking for modeling 
information processes. While other IR models can be de- 
rived from LIT, the least information measure can also be 
used with existing frameworks. For example, it can be used 
to match statistical distributions such as in document and 
query language models, for which KL information has been 
used. With demonstrated potentials in this work, we be- 
lieve further research on least information modeling for IR 
is promising. 
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